133 research outputs found

    DSPatch: Dual Spatial Pattern Prefetcher

    Full text link
    High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher performance when higher DRAM bandwidth is available. Prefetchers need the ability to dynamically adapt to available bandwidth, boosting prefetch count and prefetch coverage when headroom exists and throttling down to achieve high accuracy when the bandwidth utilization is close to peak. To this end, we present the Dual Spatial Pattern Prefetcher (DSPatch) that can be used as a standalone prefetcher or as a lightweight adjunct spatial prefetcher to the state-of-the-art delta-based Signature Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of modulated spatial bit-patterns. The key idea is to: (1) represent program accesses on a physical page as a bit-pattern anchored to the first "trigger" access, (2) learn two spatial access bit-patterns: one biased towards coverage and another biased towards accuracy, and (3) select one bit-pattern at run-time based on the DRAM bandwidth utilization to generate prefetches. Across a diverse set of workloads, using only 3.6KB of storage, DSPatch improves performance over an aggressive baseline with a PC-based stride prefetcher at the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in memory-intensive workloads and up to 26%). Moreover, the performance of DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to 10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201

    Bridging the Gap between Cosmic Dawn and Reionization favors Faint Galaxies-dominated Models

    Full text link
    It has been claimed that traditional models struggle to explain the tentative detection of the 21\,cm absorption trough centered at z17z\sim17 measured by the EDGES collaboration. On the other hand, it has been shown that the EDGES results are consistent with an extrapolation of a declining UV luminosity density, following a simple power-law of deep Hubble Space Telescope observations of 4<z<94 < z < 9 galaxies. We here explore the conditions by which the EDGES detection is consistent with current reionization and post-reionization observations, including the neutral hydrogen fraction at z6z\sim6--88, Thomson scattering optical depth, and ionizing emissivity at z5z\sim5. By coupling a physically motivated source model derived from radiative transfer hydrodynamic simulations of reionization to a Markov Chain Monte Carlo sampler, we find that it is entirely possible to reconcile the high-redshift (cosmic dawn) and low-redshift (reionization) existing constraints. In particular, we find that high contribution from low-mass halos along with high photon escape fractions are required to simultaneously reproduce cosmic dawn and reionization constraints. Our analysis further confirms that low-mass galaxies produce a flatter emissivity evolution, which leads to an earlier onset of reionization with gradual and longer duration, resulting in a higher optical depth. While our faint-galaxies dominated models successfully reproduce the measured globally averaged quantities over the first one billion years, they underestimate the late redshift-instantaneous measurements in efficiently star-forming and massive systems. We show that our (simple) physically-motivated semi-analytical prescription produces consistent results with the (sophisticated) state-of-the-art \thesan radiation-magneto-hydrodynamic simulation of reionization.Comment: 14 pages, 6 figures. Accepted for publication in ApJ. Comments are welcom

    Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

    Full text link
    Long-latency load requests continue to limit the performance of high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: 1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and 2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy. The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: 1) accurately predict which load requests might go off-chip, and 2) speculatively fetch the data required by the predicted off-chip loads directly from the main memory, while also concurrently accessing the cache hierarchy for such loads. To enable Hermes, we develop a new lightweight, perceptron-based off-chip load prediction technique that learns to identify off-chip load requests using multiple program features (e.g., sequence of program counters). For every load request, the predictor observes a set of program features to predict whether or not the load would go off-chip. If the load is predicted to go off-chip, Hermes issues a speculative request directly to the memory controller once the load's physical address is generated. If the prediction is correct, the load eventually misses the cache hierarchy and waits for the ongoing speculative request to finish, thus hiding the on-chip cache hierarchy access latency from the critical path of the off-chip load. Our evaluation shows that Hermes significantly improves performance of a state-of-the-art baseline. We open-source Hermes.Comment: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

    Full text link
    Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

    Full text link
    Conventional virtual memory (VM) frameworks enable a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which leads to high address translation latency and large translation-induced interference in the memory hierarchy. On the other hand, restricting the address mapping so that a virtual address can only map to a specific set of physical addresses can significantly reduce address translation overheads by using compact and efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases data accesses to the swap space of the storage device, even in the presence of free memory. We propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to harmoniously co-exist in the system. The key idea of Utopia is to manage physical memory using two types of physical memory segments: restrictive and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme that maps virtual addresses to only a specific set of physical addresses and enables faster address translation using compact translation structures. A flexible segment employs the conventional fully-flexible address mapping scheme. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference. Utopia improves performance by 24% in a single-core system over the baseline system, whereas the best prior state-of-the-art contiguity-aware translation scheme improves performance by 13%.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Exploring real-world symptom impact and improvement in well-being domains for tardive dyskinesia in VMAT2 inhibitor-treated patients via clinician survey and chart review

    Get PDF
    Introduction: Two vesicular monoamine transporter 2 (VMAT2) inhibitors are approved in the United States (US) for the treatment of tardive dyskinesia (TD). There is a paucity of information on the impact of VMAT2 inhibitor treatment on patient social and physical well-being. The study objective was to elucidate clinician-reported improvement in symptoms and any noticeable changes in social or physical well-being in patients receiving VMAT2 inhibitors. Methods: A web-based survey was offered to physicians, nurse practitioners, and physician assistants based in the US who prescribed valbenazine for TD within the past 24 months. Clinicians reported data from the charts of patients who met the inclusion criteria and were allowed to recall missing information. Results: Respondents included 163 clinicians who reviewed charts of 601 VMAT2-treated patients with TD: 47% had TD symptoms in ≥2 body regions, with the most common being in the head or face and upper extremities. Prior to treatment, 93% of patients showed impairment in ≥1 social domain, and 88% were impaired in ≥1 physical domain. Following treatment, among those with improvement in TD symptoms (n = 540), 80% to 95% showed improvement in social domains, 90% to 95% showed improvement in physical domains, and 73% showed improvement in their primary psychiatric condition. Discussion: In VMAT2-treated patients with TD symptom improvement, clinicians reported concomitant improvement in psychiatric disorder symptoms and in social and physical well-being. Regular assessment of TD impact on these types of domains should occur simultaneously with movement disorder ratings when evaluating the value of VMAT2 inhibitor therapy

    Search for continuous gravitational wave emission from the Milky Way center in O3 LIGO--Virgo data

    Get PDF
    We present a directed search for continuous gravitational wave (CW) signals emitted by spinning neutron stars located in the inner parsecs of the Galactic Center (GC). Compelling evidence for the presence of a numerous population of neutron stars has been reported in the literature, turning this region into a very interesting place to look for CWs. In this search, data from the full O3 LIGO--Virgo run in the detector frequency band [10,2000] Hz[10,2000]\rm~Hz have been used. No significant detection was found and 95%\% confidence level upper limits on the signal strain amplitude were computed, over the full search band, with the deepest limit of about 7.6×10267.6\times 10^{-26} at 142 Hz\simeq 142\rm~Hz. These results are significantly more constraining than those reported in previous searches. We use these limits to put constraints on the fiducial neutron star ellipticity and r-mode amplitude. These limits can be also translated into constraints in the black hole mass -- boson mass plane for a hypothetical population of boson clouds around spinning black holes located in the GC.Comment: 25 pages, 5 figure
    corecore